Multiple phenotype association tests based on sliced inverse regression

Background Joint analysis of multiple phenotypes in studies of biological systems such as Genome-Wide Association Studies is critical to revealing the functional interactions between various traits and genetic variants, but growth of data in dimensionality has become a very challenging problem in the widespread use of joint analysis. To handle the excessiveness of variables, we consider the sliced inverse regression (SIR) method. Specifically, we propose a novel SIR-based association test that is robust and powerful in testing the association between multiple predictors and multiple outcomes. Results We conduct simulation studies in both low- and high-dimensional settings with various numbers of Single-Nucleotide Polymorphisms and consider the correlation structure of traits. Simulation results show that the proposed method outperforms the existing methods. We also successfully apply our method to the genetic association study of ADNI dataset. Both the simulation studies and real data analysis show that the SIR-based association test is valid and achieves a higher efficiency compared with its competitors. Conclusion Several scenarios with low- and high-dimensional responses and genotypes are considered in this paper. Our SIR-based method controls the estimated type I error at the pre-specified level \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha $$\end{document}α.


Introduction
In recent biomedical research, Genome-Wide Association Studies (GWAS) often requires the simultaneous consideration of multiple phenotypes.It has been shown that jointly analyzing the multiple phenotypes together can increase statistical power to detect genetic variants [1,2].Introducing more information through the joint analysis will benefit revealing the complex relationship that may be undiscovered by the single phenotype analysis [3].Meanwhile, the progressive improvements in data collection techniques have made it possible to measure more types of high-dimensional data on the same subject.So far, the common strategies used to detect genetic associations in the joint analysis of multiple phenotypes can be roughly classified into several categories, such as regression model-based methods, nonparametric association methods, and p value correction methods.The regression model-based methods mainly exploit multivariate regression models [4][5][6], generalized estimating equations (GEEs), and mixed effects models [7][8][9].As functional regression models perform well in most cases, Chui et al. [10] extended them to meta-analysis of pleiotropy traits, and Wang et al. [11] developed multivariate functional linear models and hypothesis testing procedure to test the association between multiple quantitative traits and multiple genetic variants in one genetic region.As a representative of the nonparametric association method, Zhang et al. [12] tested any hybrid of dichotomous, ordinal, and quantitative traits based on a generalization of Kendall's tau.Zhu et al. [13] extend their method to accommodate covariates and proposed a nonparametric covariate-adjusted association test.Among the representative p value correction methods, one approach is Fisher's combined method [14], which integrated the results from standard univariate analysis p value, and has been extended to dependent univariate test.Another approach called the minimum of the p value (Minp) method [15] has been applied to independent test studies to improve power when a SNP affects only a very small number of multiple phenotypes, but it is less powerful for denser signals.Sluis et al. [16] proposed the TATES method, which has good power in the presence of very few association signals but can lose its dominance otherwise.
Nevertheless, these methods focus only on the low-dimensional or moderate numbers of phenotypes.To this end, several dimension reduction methods, including principal component (PC) analysis, have been developed to reduce the high dimensionality of the phenotypes.Liu and Li [2] proposed the PC-based tests to take the high dimensional phenotypes into account and proved how to combine PCs together to achieve better power.But in fact, the PC-based tests consider only a single SNP, which makes it impossible to directly extend these tests to study the association between high dimensional phenotypes and multiple genotypes.Actually, with the development of next-generation sequencing technologies, recent GWAS usually collects high-dimensional SNPs and phenotypes.The implementation of association study often encounters other difficulties associated with the extremely high computational burden.
As another attempt to cope with the excessiveness of variables, Cook [17] introduced the idea of sufficient dimension reduction (SDR), which assumes that the response variable relates to only a few linear combinations of the many covariates.In this paper, we intend to reduce the dimension of the multivariate phenotype y without loss of informa- tion about the multiple genotypes g based on the idea of SDR, where the dimension of y could be large.To this end, we use the sliced inverse regression (SIR) proposed by Li [18] to seek the effective dimension-reduction (e.d.r) direction and propose a SIR-based testing method for genetic association with multiple phenotypes.Different from the principal component analysis, the motivation behind the SIR is to preserve regression information during carrying out dimension reduction of multivariate phenotype, so that the resulting variates capture important features of the regression relationship between the multivariate phenotypes and multiple phenotypes.The simulation studies illustrate that the type I error rates of our proposed tests are well-controlled and that the power is robust and powerful.We also apply the proposed SIR-based test to a real dataset, the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, and successively identify the new genetic variants.

Data structure and linear regression
Suppose that data are collected from a population-based sequencing study with n independent individuals.For each individual, we observe q disease phenotypes and genotypes at k SNPs.Let the phenotype vector and genotype vector be y = (y 1 , . . ., y q ) T and g = (g 1 , . . ., g k ) T , respectively.Then, the observations of genotypes and trait measure- ments for n individuals are denoted as an n × k matrix G = (g 1 , . . ., g n ) T and an n × q matrix Y = (y 1 , . . ., y n ) T , respectively.Here, notations of g i and y i refer to the instances of g and y for i-th individual ( i = 1, . . ., n ).In this study, the q could be large, so detect- ing disease-associated genetic variants with large q is very challenging.In addition, the effects of the correlation between phenotypes and the direction of genetic effects should be considered.For brevity, we focus on the most popular continuous phenotypes only and consider a multivariate linear regression model with large q.
To describe the relationship between the phenotype y and the genotype g , we propose the following linear model: where B = (β T 1 . . ., β T k ) T is a k × q matrix of the regression coefficients, β j = (β j1 , . . ., β jq ) is a q-dimensional row vector of regression coefficients for the j-th SNP, which represents the effects of the j-th SNP on q phenotypes, and ε is the q-dimen- sional error vector with zero mean and true covariance matrix, which is often unknown.Here, the intercepts are omitted with y being properly centered.We can rewrite the above model in the following matrix form: where E = (ε 1 , . . ., ε n ) T is the n × q error matrix with ε i being the q-dimensional error vector for i-th individual.
Our primary interest lies in testing whether the genetic markers g are associated with the traits y .To address this problem, two strategies are commonly considered.Firstly, test- ing the effects of the j-th SNP on q phenotypes is equivalent to testing the null hypothesis H 0 : β j = 0 against the alternative hypothesis H 1 that at least one element of β j is not equal to zero.In this case, the Wald-type statistic T 1 = βj Cov( βj ) −1 βT j is adopted, where βj is the maximum likelihood estimator (MLE) of β j and Cov( βj ) is its covariance matrix.The test statistic T 1 has q degrees of freedom.When q is large and heterogeneous effects exist, especially when a variant only affects a subset of traits, the test statistic may be less powerful due to the large degrees of freedom.Furthermore, conducting the association study with multiple tests often results in a significant loss of statistical power due to a large number of comparisons.The second strategy involves considering the association between all k SNPs and q phenotoypes, which is equivalent to testing the null hypothesis (1) 1 , where B1 is the MLE of B 1 = (β 11 , . . ., β 1q , β 21 , . . ., β 2q , . . ., β k1 , . . ., β kq ) .The test statistic T 2 has kq degrees of freedom and the implementation of the association study with highdimensional phenotypes often encounters other difficulties concerning the extremely high computational burden.Since both T 1 and T 2 have their own disadvantages for large degrees of freedom, a common solution is to reduce the dimensionality of responses and/or predictors.In the following subsections, we present a SIR-based dimension reduction of y and test procedures with reduced phenotypes y.

SIR-based dimension reduction of y
As the dimension of phenotype y is very large, it is highly desirable that interesting fea- tures of high-dimensional data are retrievable from low-dimensional projections.PCA is perhaps the most well-known method for reducing dimensionality.But since the procedure is carried out without using the predictor variable, certain interesting regression variables may be lost in the reduction process and hardly capture important features of the regression relationship between response variables and predictors.Another attempt to cope with the excessiveness of variables is the SDR approach, which assumes that only a few linear combinations of original variables are sufficient to reveal the information within them without changing their explanatory effect on the response variable.Identifying these linear combinations is the goal of dimension reduction.To this end, many authors have utilized the so-called sliced inverse regression (SIR) method proposed by [18], which focuses on the inverse regression method by reversing the relation between the response and predictor variables to benefit from the response variable being, usually, of lower dimension than predictor vector.Here, we adopt the SIR method to reduce the dimension of the multivariate response y without loss of information about the multiple genotypes g.
To better understand the SIR method, we consider the following forward regression model: where S 1 , . . ., S d are unknown e.d.r directions, d is the number of dimensions one want to achieve, ǫ is independent of y , and f is an arbitrary unknown function on R d+1 .When the model (3) holds, the q-dimensional y can be projected onto the d-dimen- sional subspace with d ≪ q , so that interesting features of the high-dimensional y are compressed by low-dimensional projections.If g is statistically independent of y when S T m y, m = 1, . . ., d , are given, it is sufficient to focus only on the d reduced variables S T m y 's for studying the relationship between g and y .At this point, the column space of a q × d matrix S = (S 1 , . . ., S d ) becomes a SDR subspace.
To reduce the dimension as much as possible, we are often interested in the SDR subspace with the smallest dimension.Under mild conditions [17,19], the intersection of all SDR subspaces is still an SDR subspace, and the smallest SDR subspace is called the central subspace.For notational simplicity, in the following, we assume the central subspace to be estimated is spanned by a q × d 0 basis matrix, denoted by (3) g = f (S T 1 y, S T 2 y, . . ., S T d y, ǫ), S 0 = S 1 , . . ., S d 0 .If we further assume that y has been standardized to z , under a linearity condition that E(z | S T 0 z) is linear in S T 0 z , it is guaranteed that the E(z | g) belong to the space spaned by S 1 , . . ., S d 0 [18,20].Then, we can estimate the central subspace by applying a principal component analysis to the random vector E(z | g) , following the approach proposed by [18].Equivalently, we can derive a basis of central subspace by solving the solution of which is formed by the d 0 leading eigenvectors of where T  In this study, our concern is focused on testing the association between marker genes and multiple traits.However, under the null hypothesis, the traits and genes are independent.In this case, and the estimation of Cov[E(z | g)] will be very small and close to 0 in actual situations, then it becomes challenging to directly apply the PCA on the Cov[E(z | g)] by following the SIR method suggested by [18].In the following, we give a detailed estimation procedure utilizing the SIR scheme based on the observed data ( g i , y i ), i = 1, . . ., n : (a) Standardize y i by an affine transformation to get z i = �−1/2 yy (y i − y) , ( i = 1, . . ., n ), where ˆ yy and y are the sample covariance matrix and sample mean of y 1 , . . ., y n , respectively.(b) Divide the range of g into the H slices, I 1 , . . ., I H ; let the proportion of the g i 's fall- ing into the h-th slice be ph ( h = 1, . . ., H ), that is, ph =(1/n) n i=1 δ h g i , where δ h g i takes the value 0 or 1 depending on whether g i falls into the h-th slice I h or not.(c) Within each slice, compute the sample covariance of the z i 's, denoted by firstly, form the weighted mean value Ê = H h=1 ph vh ; next, find the eigenvalues and the eigenvectors for Ê.
(4) arg max (6) arg min When dividing the range of g , the most natural choice is to divide it into H = 3 k slices, considering the fact that each locus of the k SNPs takes values in {0, 1, 2} .However, if the dimension of genotypes is very high, then such a straightforward implementation, while theoretically possible, is intractable in practice.This is because there will be many empty slices due to a massive number of slices and the limited sample size, making it impossible to calculate the covariance in those empty slices.For this reason, we adopt an alternative way of dividing the range of g and grouping individuals, following the approach men- tioned in [21].Specifically, we first estimate the genetic relatedness matrix to measure genetic similarity among individuals and divide the range of g in terms of that similarity.
Next, we merge adjacent slices so that the number of individuals in each slice is not less than 5.Then, we calculate the conditional covariance of each slice according to the estimation procedure for E[Cov(z | g)] , which is described above.

SIR-based association test with reduced phenotypes
After estimating the d 0 standardized e.d.r directions, the q-dimensional y can be projected onto the d 0 -dimensional central subspace with d 0 ≪ q .Then, the predictor variable g is related to only d 0 linear combinations, S T 1 y, . . ., S T d 0 y , and it is sufficient to focus only on them.According to [17,18], it is fair to say that one-component model ( d 0 = 1 ) has pre- vailed, therefore, for the sake of simplicity, only case of d 0 = 1 is considered in this paper.
Consequently, the large dimensional phenotype y i (i = 1, . . ., n) can be transformed into ỹi ∈ R without loss of information on the corresponding genotype g i .At this point, we can investigate the relationship between phenotype y and genotype g in the following form: where Ẽ = (ε 1 , . . ., εn ) T is an n × 1 error vector with εi being the error term for i-th individual, Ỹ = (ỹ 1 , . . ., ỹn ) T is an n × 1 vector of the traits, β is a regression coefficient vector.
We aim to test whether the set of genetic markers is associated with phenotype after dimension reduction.This is equivalent to testing the null hypothesis H 0 : β = 0 against the alternative hypothesis H 1 that at least one element of β is not equal to zero.In this case, Wald-type statistic β no longer follows the chi-square distribution under the null hypothesis, where β is the MLE of β and Cov( β) is its covari- ance matrix.We use a permutation procedure to establish the null distribution of T .The permutation is done by randomly assigning the genotypes while keeping the phenotypes for each individual.For each permuted data set, we use (7) to calculate T as we have done by using the original data set.We repeat this procedure 1000 times to generate the distribution of T under the null hypothesis of no association between multiple geno- types and the phenotypes.This testing strategy, in the sense that it is about all coefficients, can be seen as a global test.( In addition to this, it is also possible to focus on the association of the single SNP after dimension reduction.To this end, we apply Bonferroni correction to adjust for multiple testing involving k markers, which is equivalent to testing the null hypothesis H 0 : βk = 0 against H a : βk � = 0 .The model for this case is where βk is a regression coefficient and εi is an error term for i-th individual.The test statistic T 2 k = β2 k /Var βk also does not follow the chi-square distribution under the null hypothesis, where βk is the MLE of βk and Var( βk ) is its variance.To carry out the association test, we apply the permutation procedure to estimate the distribution of Tk .

Simulation studies
We conduct a series of simulation studies to evaluate the numerical performance of the proposed association tests in comparison with eight other PC-based competing tests, such as PCA1, PCFisher, PCMinp, PCLC, Wald, WI, VC, and PCAQ [2].In these PC-based tests, the PCA1 indicates using only the first principal component, the PCFisher can be viewed as a nonlinear combination of the PC p values, the PCMinp uses the minimum PC p value as a testing statistic, and other tests aim at constructing the linear or quadratic combinations of PCs weighted by the functions of eigenvalues.We simulate the genotype g i = (g i1 , . . ., g ik ) T for the i-th individual at k SNPs, where the genotype of each SNP is sampled from a uniform distribution with a minor allele frequency (MAF=p) between 0.3 and 0.5 under the assumption of Hardy-Weinberg equilibrium.That is, p aa = (1 − p) 2 , p Aa = 2p(1 − p) , and p AA = p 2 .The q-dimensional phenotype y i of the i-th individual is generated from the model (1), where ε i follows N (0, �) with � lm = ρ |l−m| for 1 ≤ l, m ≤ q , and ρ is the correlation coefficient between phenotypes.Note that the simulated data under the null hypothesis of β = 0 can be used to calculate type I errors, whereas the data under the alter- native hypothesis saying that β contains at least one nonzero element can be used to calculate powers for each method.Hereafter, this global test mentioned above is expressed as SIR in this study.Alternatively, based on the model ( 8), we can also perform the association test for each SNP separately, and adjust the test for all the SNPs through multiple testing procedure, named SIR-S.
In the simulation studies, we consider three scenarios: Scenario 1 is for the lowdimensional phenotype (q = 5 and 10) and low-dimensional genotype (k = 5 and 10); Scenario 2 is for the high-dimensional phenotype (q = 50 and 100) but lowdimensional genotype (k = 10); Scenario 3 is for both high-dimensional phenotype (q = 50 and 100) and genotype (k = 40 and 100).We set the nominal level of significance α = 0.05 .Since the PC-based methods focus on the association test of single marker, here we apply Bonferroni correction to adjust for multiple testing involving k markers.In each scenario, we increase the correlation coefficient of phenotype in a series of ρ = 0, 0.2, 0.5, 0.7.For each scenario, we generate 100,000 and 1000 simu- lated data sets for type I error evaluation and for power calculation, respectively.(8) ỹi = g ik βk + εi , i = 1, . . ., n,

Scenario 1: low-dimensional phenotype and low-dimensional genotype
In this scenario, the dimension of phenotype is set to be q = 5 and 10, and the number of SNPs is set to be k = 5 and 10.We compare the power of each method in terms of the signal direction, signal strength, and the correlation structure among phenotypes.To this end, we consider different values of effect vector for each phenotype, specifically four cases for this scenario: Case 1 is for k = 5 , q = 5 ; Case 2 is for k = 5 , q = 10 ; Case 3 is for k = 10 , q = 5 ; Case 4 is for k = 10 , q = 10 .Here, we let most β j 's be zeros except for β 3 and β 4 being nonzeros.The effect vector of the third SNP β 3 on each phenotype is positive, except for Case 4, where its direction is mixed.The value of β 4 is given such that the fourth SNP is only associated with the second trait in all settings.The detailed setting of effect vectors is shown in Table 1.
From the results summarized in Table 2, it is apparent to see that the estimated type I error values of both SIR and SIR-S methods for different values of q, k, and ρ are very close to the true error level of α = 0.05 and the two methods have the well-controlled empirical type I error rates in most cases.For further comparisons, we also make the PC-based tests as additional baseline methods.Table 2 clearly shows that all the PCbased methods retain the empirical type I errors very well at the significance level in most cases.Notice that the type I error rate of the VC method has slightly conservative with the empirical type I error of 0.04237 when we set k = 10 and q = 5 .Overall, the SIR and SIR-S methods can accurately control the empirical type I errors at the nominal level.
We further compare the empirical powers of the proposed tests with the existing PCbased methods.For each setup, we generate n = 1000 and 2000 samples.The powers are calculated by the proportion of p values less than the significance level.We take the signal direction, signal strength, and the correlation structure among traits into account.Figures 1 and 2 show the powers of the ten comparative methods for different settings.We can see that the powers of the SIR and SIR-S methods are close to 1 and other PCbased methods are more powerless than the two methods in the case of k = 5 .In a nut- shell, with the same number of genotypes, if the dimension q is equal to 5, the powers of PC-based methods will decrease as the correlation coefficient increases, but if the dimension q is equal to 10, the power increases contrarily.However, the proposed methods still have much higher power than the other alternative methods.Different from the Table 1 Low-dimensional setting of effect vectors in Scenario 1 * The default value of other effect vectors β j 's are 0 Case 1 k=5, q=5 β 3 = (1.10,1.10, 1.10, 1.10, 1.10) β 4 = (0.00, 0.02, 0.00, 0.00, 0.00) Case 2 k=5, q=10 β 3 = (1.10,1.10, 1.10, 1.10, 1.10, 0.00, 0.00, 0.00, 0.00, 0.00) β 4 = (0.00, 0.02, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00) Case 3 k=10, q=5 β 3 = (1.10,1.10, 1.10, 1.10, 1.10) β 4 = (0.00, 0.02, 0.00, 0.00, 0.00) Case 4 k=10, q=10 β 3 = (1.10,−1.10, 1.10, −1.10, 1.10, 0.00, 0.00, 0.00, 0.00, 0.00) β 4 = (0.00, 0.02, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00) case of k = 5 , we can see that the powers of the SIR and SIR-S methods decrease as the dimension of genotype k increases.The PCFisher, Wald, and VC have comparable performances to the proposed tests when the effect vectors are in a mixed direction for a strong phenotypic correlation.From Figs. 1 and 2, we know that both the SIR and SIR-S methods are sensitive to the direction of the signal.The increase in sample size has little effect on the power of all methods.

Scenario 2: high-dimensional phenotype and low-dimensional genotype
To further show the performance of the proposed methods in the case of high-dimensional phenotype, we carry out additional simulations to compare our SIR and SIR-S methods with the other eight methods.Since Scenario 1 shows the sample size has little effect on the power for all methods, in this simulation, we only generate n=1000 individuals with different correlation structures of traits.The datasets are generated similarly to Scenario 1 except for the effect vectors.We consider three cases for this scenario: Case 1 is for k = 10 , q = 50 ; Case 2 is for k = 10 , q = 50 ; Case 3 is for k = 10 , q = 100 .The effect vector of the third SNP β 3 on the first five phenotypes is positive in Case 1 and Case 3, while the effect vectors in Case 2 have mixed directions.The setting of Table 3 The setting of effect vectors in Scenario 2 * The default value of other effect vectors β j 's are 0 Case 1 k = 10, q = 50 β 3 = c(1.10,1.10, 1.10, 1.10, 1.10, 0.00, 0.00, . . ., 0.00) β 4 = c(0.00,0.02, 0.00, 0.00, . . ., 0.00) Case 2 k = 10, q = 50 β 3 = c(1.10,−1.10, 1.10, −1.10, 1.10, 0.00, 0.00, . . ., 0.00) β 4 = c(0.00,0.02, 0.00, 0.00, . . ., 0.00) Case 3 k = 10, q = 100 β 3 = c(1.10,1.10, 1.10, 1.10, 1.10, 0.00, 0.00, . . ., 0.00) β 4 = c(0.00,0.02, 0.00, 0.00, . . ., 0.00) the different effect vectors is shown in Table 3. Table 4 summarizes the empirical type I errors of these methods for the association analysis.It is clear that all methods can control the empirical type I error well in most cases.Then, we compare the powers of our methods with the PC-based methods.When q = 50 , we have two cases and set the effects of the third SNP on the first five traits to be positive and mixed directions, respectively.In all cases, the fourth SNP is only associated with the second trait.Simulation results for power comparisons are shown in Fig. 3. Figure 3 shows that the powers of our SIR and SIR-S methods decrease when the effect vectors are in mixed directions for the high-dimensional phenotypes.However, effect vectors in mixed directions do not affect the power of the PC-based methods.From these observations, we can see that our SIR and SIR-S methods are sensitive to the direction of effect vector for high-dimensional phenotypes.Clearly, the powers of all methods are affected by the dimensional increase of the phenotype to a certain degree.Compare to Scenario 1, the powers of both the SIR and SIR-S methods are somewhat decreased, but still, our methods outperform the competing methods in most cases.
Fig. 3 The evolution of power along with the varying correlation ρ in Scenario 2

Scenario 3: high-dimensional phenotype and high-dimensional genotype
We conduct additional simulations to compare the performance of our proposed tests with existing PC-based methods for both high-dimensional phenotype and genotype.From Scenario 2, we know that when effect vectors are in mixed directions, the powers of our proposed methods decrease in a high-dimensional phenotype setting.For a fair comparison with the PC-based methods, in this simulation, we consider the effect vector of the third SNP β 3 on the first five phenotypes to be of mixed directions.Specifically, we consider high-dimensional phenotype and genotype with four cases: Case 1 is for k = 40 , q = 50 ; Case 2 is for k = 40 , q = 100 ; Case 3 is for k = 100 , q = 50 ; Case 4 is for k = 100 , q = 100 .Table 5 shows the setting of different effect vectors.The datasets are generated similarly to Scenario 1, but here the number of SNPs is k = 40 or 100.
Table 6 summarizes the simulation results for type I error estimates.It clearly shows that all methods can retain the empirical type I errors very well at the significance level.
Figure 4 presents the simulation results of power comparisons for all settings.The powers of our SIR and SIR-S methods are reduced as the dimensions of both genes and phenotypes increase.Nevertheless, the powers of the proposed methods are still slightly higher than the PC-based methods.

Scenario 4: simulation based on a real genotype data
In this section, we perform additional simulations to evaluate the performance of our SIR and SIR-S procedures on a more realistically simulated data, and compare with the other eight methods based on a real genotype data from the Genetic Analysis Workshop 17 (GAW17).The genotype data of 697 unrelated individuals are extracted from the sequence alignment files provided by the 1000 Genomes Project for their pilot3 study (http:// www.1000g enomes.org), in which we choose the TG gene as a candidate gene.The TG gene has 146 SNPs which encodes the thyroglobulin, one of the largest proteins in the human body, and mutation of the TG gene may cause hypothyroidism and autoimmune disorders [22].
In this simulation, the 100 dimensional phenotypes of the 697 individuals are generated from the model (1).To focus on the main points, six SNPs are selected as the causal variants.Specifically, the three SNPs, 20-th, 60-th, 100-th, are chosen to be far away and the others, 4-th, 6-th, 8-th, are chosen to be clustered.To consider the fact that the causal SNPs affect the disease in different directions, we set the effect vector of the each SNP β j on the first five phe- notypes to be of mixed directions, while the rest of them are set to be 0 .We generate 100,000 simulated data sets for type I error evaluation and 1000 data sets for power comparison.

Application to the sequencing data from ADNI
We analyze the ADNI1 and ADNI2 datasets from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study.The ADNI seeks to develop biomarkers of the disease and advance the understanding of AD (Alzheimer's disease) pathophysiology, so as to improve diagnostic methods for early detection of AD and improve the clinical trial design.Additional goals are examining the rate of progress for both mild cognitive impairment and Alzheimer's disease, as well as building a large repository of clinical and imaging data.ADNI is a study that assesses the effects of genetic variants on AD and various AD-related outcomes, including 3D brain imaging and cognitive measurements [23].Proteolytic fragments of amyloid and post-translational modification of tau species in Cerebrospinal fluid (CSF) as well as cerebral amyloid deposition are important biomarkers for AD [24,25].A total of 800 subjects are included in the data, with 200 normal controls, 400 mild cognitive impairment (MCI), and 200 mild AD.We are interested in the association between genetic variants and five outcomes, including the hippocampus, entorhinal, amyloid beta (Aβ 42 ), tau, and phosphorylated tau (ptau 181 ) levels.It has been reported that the AOPE gene is related to AD and its associated outcomes [26].Therefore, as in [27], SNP rs769449 in gene AOPE is selected in our study.We also include 15 SNPs around rs769449 in our study: 8 SNPs on the left of rs769449 and 7 SNPs on the right, respectively.The SNPs rs8106922, rs1160985, and rs394819 are located in an intronic region of gene TOMM40, while other SNPs rs1081101, rs405509, and rs769449 are in the gene APOE, and rs445925 in gene APOC1.In the preprocessing step, we exclude the subjects which missing outcomes and genetic variants.After quality control, a total of 453 subjects are available in our study.
We conduct an association study to identify genetic factors influencing the five outcomes.All the aforementioned methods are performed with the nominal level of significance α = 0.05 .Since the PC-based methods focus on the association test of a single marker, here we apply Bonferroni correction to adjust for multiple testing involving 16 markers ( α Bonferroni = 0.05/16 = 0.0031 ).Four SNPs are detected by SIR method, including kgp8001324 ( p = 2.3 × 10 −2 ), rs405509 ( p = 5.3 × 10 −3 ), rs769449 ( p = 1.1 × 10 −3 ), and rs445925 ( p = 4.4 × 10 −2 ).Among them, two SNPs rs405509 and rs769449 are in the gene APOE, and rs445925 in the gene APOC1.Note that APOE and TOMM40 are well-known genes associated with AD [28].In particular, the SNPs kgp8001324, rs405509, and rs769449 are detected by our SIR method and other comparative methods, justifying the effectiveness of the proposed method.Meanwhile, the SNP rs445925 in gene APOC1 can be detected only by the SIR method, and APOC1 gene is reported to be a genetic risk factor for dementia and cognitive impairment in the elderly and it has a significant impact on hippocampal volumes [29].The p value of PCLC for detecting SNP kgp8001324 ( p = 7.31 × 10 −5 ) is more significant than the SIR.As for SNP rs405509 in the gene APOE, the p value of PC5 is 3 × 10 −3 similar to the SIR methods.The SNP rs769449 ( p = 1.01 × 10 −4 ) in the gene APOE is also detected by PC5.
The SNPs rs769449, rs405509, and kgp8001324 are detected by the SIR method as well as several comparative methods, which verifies the fact that these SNPs are associated with AD.In a nutshell, the SIR and PC5 methods, which detect four SNPs, perform better than other methods, but only the SIR method can detect one important SNP rs445925.In short, the SIR detects most SNPs across all cases, further confirming the advantages of the proposed method.We summarize a subset of the detected SNPs in Table 9.

Discussion
With the rapid development of next-generation sequence technologies, millions of SNPs and outcomes are usually collected in recent GWAS, and the high dimensionality of data has become a great challenge to statistical analysis.Furthermore, considering the complex correlations between multiple traits will be beneficial in revealing more latent information.In contrast to univariate analysis, multivariate analysis can exploit the correlations among phenotypes to improve power, in which a flexible framework is essential for testing the association between multiple predictors and multiple outcomes.
In this paper, we proposed a novel SIR-based association test that enables the analysis of multiple traits while taking into account the similarity between one or more traits to facilitate information borrowing.First, this procedure could preserve important information about the original regression between responses y and predictors g during car- rying out the dimension reduction.To this end, we divided the range of g according to genotype similarity and estimated the genetic relatedness matrix to measure genetic similarity between individuals during dimension reduction of phenotype y for the pro- posed method.Then, we assigned the individuals with similar genotypes to the same group, followed by conducting reduction steps, which significantly improved the computing speed.Second, several scenarios with low-and high-dimensional responses and genotypes were considered in our simulations.Our numerical studies illustrate that the powers of the SIR and SIR-S methods decrease as the genotype dimension k increases in low-dimensional phenotypes setting, where the PC-based methods exhibit comparable performances to our proposed method.In the high-dimensional phenotypes setting, we found that the direction of the effect vector has mixed direction, and the powers of proposed methods were reduced but with little effect on the PC-based methods.Finally, we conducted real-data analysis with five outcomes.Among several methods, the important SNP rs445925 in gene APOC1, which has a significant impact on hippocampal volumes, was detected only by our SIR-based method.Unlike the other methods, the SIR-based method also detected most SNPs across all cases.The analysis of ADNI data has shown that the proposed method can reveal biologically meaningful genetic markers with reasonable prediction accuracy and stability, providing suggestions for further clinical or epidemiological research.Through real-data analysis, we further confirmed that our method is more conducive to understanding the underlying genetic architecture in the multiple phenotype studies.Note that our method cannot be applied to GWAS data in that the model (7) is not suitable to it.Although we can test each SNP one by one based on the model ( 8) to perform GWAS by adjusting for multiple testing theoretically, the procedure of SIR-based dimension reduction of phenotype needs to merge adjacent slices based on the genetic relatedness matrix which is estimated through the empirical correlation between two individuals.Therefore, it becomes increasingly challenging to guarantee gene similarity when there are more SNPs.

Conclusion
There are still some problems not be worked out, which will be investigated in our upcoming research.Here, we adopted the SIR-based method to estimate the central subspace, but other methods such as the sliced average variance estimation (SAVE) [30] and the directional regression (DR) [31] are also worth trying in the future.In addition, this paper only considered the case of one component d 0 = 1 , but correlation analysis with multiple components can be similarly considered.We hope that the proposed methods can help in the search for genetic variants of complex diseases, and stimulate further interest and research in developing statistical methods for the analysis of next-generation sequence data.Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (adni.loni.usc.edu)(National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012).As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report.A complete listing of ADNI investigators can be found at: http:// adni.loni.usc.edu/ wp-conte nt/ uploa ds/ how_ to_ apply/ ADNI_ Ackno wledg ement_ List.pdf.ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.
and tr(•) represents the sum of the eigenvalues of the matrix Cov[E(z | g)].
Fortunately, we see the relation between Cov[E(z | g)] and E[Cov(z | g)] as Alternatively, by applying the eigenvalue decomposition of E[Cov(z | g)] , we can deter- mine the standardized e.d.r.directions which are the eigenvectors associated with the d 0 smallest eigenvalues.This procedure equivalently derives a basis of central subspace by solving the solution of which is formed by the d 0 leading eigenvectors of E[Cov(z | g)].

Fig. 1 Fig. 2
Fig.1The evolution of power along with the varying correlation ρ in the case of n = 1000

Fig. 4
Fig.4 The evolution of power along with the varying correlation ρ in Scenario 3

Table 2
Empirical type I errors based on 100,000 replicates in Scenario 1

Table 4
Empirical type I errors based on 100,000 replicates in Scenario 2

Table 6
Empirical type I errors based on 100,000 replicates in Scenario 3 Table 7 lists the empirical type I errors of the ten methods of the association analysis for TG gene at the nominal level of 0.05.From Table 7, it is apparent to see that all the

Table 7
Empirical type I errors of the TG gene based on 100,000 replicates in Scenario 4

Table 8
Empirical powers of the TG gene based on 1000 replicates in Scenario 4 methods control the empirical type I errors of the TG gene very well.Table8shows the power comparison results of the ten methods for different settings.It clearly shows that all methods are robust to the proportion of the causal variants, and the SIR and SIR-S methods provide more power than the other methods in most cases.